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Abstract 

We consider exact enumerations and probabilistic properties of 
ranked trees when generated under the random coalescent process. 
Using a new approach (see 0, E3]), based on generating functions, 
we derive several statistics such as the exact probability of finding k 
cherries in a ranked tree of fixed size n. We then extend our method to 
consider also the number of pitchforks. We find a recursive formula to 
calculate the joint and conditional probabilities of cherries and pitch- 
forks when the size of the tree is fixed. 

1 Introduction 

Given a direction by time, ancestry relationship between species, individuals, 
alleles or cells can be depicted as a rooted tree. Of particular interest are 
binary rooted unordered trees. These can be further classified into several 
subclasses. Here we will ranked trees, which are defined below. 
We assume that trees are generated by the coalescent process. 
An important parameter is the number of cherries of a tree. By a new 
approach based on generating functions we extend previous results (see for 
example 0] ) deriving an exact formula for the probability of finding k cher- 
ries in a ranked tree of size n. Furthermore, we show that several known 
statistics (see 0) concerning pUchforks follow as corollaries from a partial 
differential equation which also gives an efficient recursion to compute the 
conditional probability distribution of pitchforks given a certain number of 
cherries. 
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One motivation for this study comes from population genetics and the 
question how 'typical' coalescent trees [13[ look like. Our results give some 
insight into structural properties of trees generated under the standard neu- 
tral model [12j . These results provide a reference against which non neutral 
and/or non independently generated trees may be compared. To illustrate 
the latter we pay attention to trees which are linked along a recombining 
chromosome. 



2 Preliminaries 

We start with some basic definitions. A binary rooted tree is a tree with a 
root and in which all nodes have outdegree either or 2. Nodes with out- 
degree 2 are called internal, nodes with outdegree are external. External 
nodes are also called leaves. The size n of a tree is the number of its external 
nodes. The subtree of an internal node i is the tree with root i. A tree is 
said to be un-ordered when it is taken in the graph theoretic sense so that 
subtrees stemming from an internal node have not a left-right order between 
themselves. Here, we care about tree topology and we do not care about 
branch lengths. We consider the following class. A binary un-ordered tree of 
size n is said to be a ranked tree if the set of internal nodes is totally ordered 
by labels belonging to {1,2, ...,n} in such a way that each child's label is 
greater than its parent's label, (see Fig. [1]). The total order of internal labels 
can be interpreted as a historical time order; accordingly, Harding [3] calls 
such trees histories. 

We will denote by 1Z the set of ranked trees and by 1Z n the set of trees 
of size n. In what follows, n = n(t) always represents the number of leaves 
of a ranked tree t. 

The cardinality of the set 7Z n is given by the following exponential gen- 
erating function 

\1Z I 

U(x) = ~^x n = sec(x) + tan(x). (1) 
n>0 n ' 

whose first coefficients \lZ n \ (with n > 0) are 
1,1,1,2,5,16,61,272,.... 
Ranked trees can be bijectively mapped to 0-1-2- increasing trees (see Callan, 



2005; http://www.stat.wisc.edu/~callan/notes). From this, it follows 



that the numbers given by ([I]) correspond to sequence ^4000111 in Sloane 



U| and are known as Euler numbers. 



2 



trees 


# cherries 


# pitchforks 


2 A 2 A A 2 A 

Ax Ax Ax Ax 


3 





A A A A 
Ax X Ax X Ax x Ax x 


2 


1 


AAA 

X\A\ X\A\ X\A\ 


2 


1 


A Aa A a 


2 


2 




2 





A 


1 


1 



Figure 1: The sixteen possible ranked trees of size six classified by shape. Within 
each class all possible orderings of the internal nodes are displayed. Number of 
cherries and pitchforks are indicated. 
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2.1 Trees as a result of the coalescent process 



The coalescent of size n is a model for the genealogical history of a sample 
of n genes. It has been introduced in population genetics by Kingman and 
Ewens 0, Q and has nowadays textbook status [jyj. Ranked trees can be 
generated by the coalescent process, which starts with n leaves and works 
by successively coalescing two randomly chosen branches until it reaches the 
'most recent common ancestor' when the last two remaining branches are 
joined. 

To reflect time order one can assign an integer to each internal node 
when created, for instance the label n — 1 to the first coalescent event and 1 
to the last event, the most recent common ancestor, or the root of the tree. 

The probability distribution of ranked trees Pji generated under the 
coalescent process is essentially contained in the paper of Tajima [13] and it 
is described below. 

Probability distribution of ranked trees 

Let t £ 1Z and let o(t) be the number of internal nodes i whose children 
are two leaves. Such internal nodes are called the cherries of the tree. For 
example, (see Fig. [1]). Given t £ lZ n , from Tajima [12j] follows that 



i.e. the probability of any ranked tree t 6 7Z n depends only on two parame- 
ters, o and n. 

The probability of generating the same ranked trees twice 

Considering trees linked on a common chromosome one observes that 
chromosomal linkage substantially increases the probability that two 'neigh- 
boring' trees are identical even if separated by a recombination event. To 
quantify the effect of linkage and recombination it is important to know the 
background probability that two independently generated trees are identical. 
This probability can be found with the help of the genarating function 







,o(t) „n(t)-l 



discussed in more details in Section [3.1.11 eq. ©. 
We have the following result. 
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Proposition 1 The probability that two independently generated ranked trees 
of size n are identical is 



Proof: From eq. ([2]) the probability that ti,t% £ TZ n are identical is 



where [z 11 1 ]Y(l/4, z) means the [n — l)-st coefficient of the Taylor ex- 
pansion of F(l/4, z) in z = 0. □ 

3 Enumerative results 

3.1 Outdegree of the nodes in ranked and 0-1-2-increasing trees 

Let t £ lZ n and m = n — 1. Remove all leaves and external branches from t 
and obtain a reduced tree p(t). The tree p(t) is a so-called 0-1-2-increasing 
tree of size m, where, this time, the size is the total number of nodes in 
the tree and not only of the leaves. The class X012 of 0-1-2-increasing trees 
is composed of un-ordered rooted trees where all nodes have outdegree 0, 
1 or 2. The m nodes of such a tree carry totally ordered labels belonging 
to {1,2, ...,m}. Moreover, the labelling is such that any child node label 
is greater than that of the parent node. As usual Ioi2 m denotes the set of 
0-1-2-increasing trees of size m. Hence, the function p is a bijection from 
lZ n to Xoi2 m • 

Given a ranked tree t, the outdegree of an internal node of t is the 
outdegree of the corresponding node in p(t). Thus, if t E 1Z, the nodes of 
outdegree (resp. 1, 2) are defined as the nodes with 2 (resp. 1, 0) leaves 
as direct descendants. 




Pn = 



ten. 
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Figure 2: First levels of the generating tree associated to 0. 

Here, we derive the enumeration of 0-1-2-increasing trees with respect 
to the size and to the number of nodes with outdegree 0,1 and 2. The 
bijection p will allow us to use this enumerative result in Section 13.1.21 to 
determine the probability distribution of the random variable o, the number 
of cherries, when t is a ranked tree of size n generated by the coalescent 
process. It is already known (see McKenzie (jj) that o(t) is asymptotically 
normal for large n. 

3.1.1 Recursive construction of 0-1-2-increasing trees 

We show now how the class of 0-1-2-increasing trees can be generated recur- 
sively. In particular we construct each tree belonging to Ion m+1 by adding a 
new node to some tree in 2oi2 m - This construction, denoted by 0, will then 
be translated into a functional equation. Solving the equation we obtain a 
bivariate exponential generating function counting the considered increasing 
trees with respect to size and to the number of nodes with outdegree 0, 1 
and 2. 

Given a tree t £ Ioi2 m , simply adds the node labelled 'ra+1' as a child 
of a node of t having outdegree less than two. Let o(t),p(t) and q(t) denote 
the number of nodes with outdegree 0, 1 and 2 respectively. applied to 
t produces o(t) + p(t) elements of Xoi2 m+1 each time adding the new node 
labelled m + 1 as a child of the nodes counted in o(t) + p(t). In Fig. [2] we 
depict the first steps of this construction process. 

Note that o(t) = q(t) + 1 and o(t) +p(t) + q(t) = m. From these relations 
we have, in particular, that p(t) = m — 2o(t) + 1. The construction can be 
translated into the following succession rule (see Banderier et al. [l|) where 
each tree is represented by a label composed of the values of its parameters 
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o and m while the exponents show how many times the label is produced, 



(o, m) — > (o, m + 1)° (o + 1, m + 1 



,m— 2o+l 



In particular, given a tree i with parameters o = o(t) and m = m(t), the 
application of 6 to t produces o new trees having size m + 1 and o cherries 
and m — 2o + 1 new trees having size m + 1 and o + 1. The starting point 
of the construction is the unique tree of size one represented by (1,1). 

Now consider the exponential generating function 

x o(t) z m(t) 
Y(x,z) = > r— - — . 

The previous succession rule can be translated as follows into an equation 
for Y(x, z). 



, ox°z m+1 v-^ (m-2o+l)(x° +1 z m+1 ) 

Y(x,z) = xz+ > — + > - ^— - 

K ' ^ (m + 1)! ^ (m + 1)! 



XZ 



~ \ x — ^ ux Z v a X z 

+ (l-2x > TT7+ XZ / r 

v ; ^ (m + 1)! ^ ml 

x o z m e Xoi2 X°Z m & 012 

= xz + (i-2x) ^ 7^TT7T + (x ' z) 

From the previous equation we obtain that 



Y(x, z)(l — xz) — xz s—^ ox°z m+1 

1 - 2x ~~ ^ (m + 1)!' 

Differentiating both sides with respect to the variable z we have 



1 fdY < wi ^ vt \ \ dY ( \ 

——{x,z)(i — xz) — xY{x,z) — x\ = x——{x,z), 



1 — 2x \ dz J dx 

which is equivalent to 



dY dY 
x(l — 2x) — — (x, z) + (xz — 1) — — (x, z) = —xY(x,z) — x. (3) 
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The previous first order partial differential equation can be solved using 
the method of characteristics (see 0]) respecting the condition given by 
eq. Q 

Y(l, z) = sec(z) + tan(z) — 1. 

Indeed Y(l,z) must represent the exponential generating function counting 
0-1-2-increasing trees with respect to size. 

Applying the method consists, first, of solving the two following ordinary 
differential equations 



, xz — 1 



z 



Y' 



x(l - 2x) 
—xY — x 
x(l - 2x) 



The solutions are 



c\ + 2 arctan(\/2;E — 1) 



Y = c 2 V2x~^l-l, (4) 

with constants c\ and C2 and where C2 can be written as a function of ci 
in the following way 

c 2 = G(ci) = G(z V2x - 1 - 2 arctan(V2» - 1)). 
In this way equation @ becomes 

Y(x, z) = G{z V2x - 1 - 2 arctan(V2» - 1)) y/2x - 1 - 1, 
which gives 

sec(z) + tan(z) - 1 = y(l, z) = G(z - |) - 1. 
Function G must satisfy 

_, N , 7T, . 7T. — 1 — COs(z) 

G(z = sec [z + -) + tan z + - = W . 

2 2 sm(z) 

Inserting this into (j4]) we have 



S 



-1 — cos(^ y/2x — 1 — 2 arctan("v/2x — 1)) v 



y (x, z) = V2^T \ v / v ~ n - - i, 

V sin(z yJ2x - 1 - 2 arctan(V2x - 1)) / 
which, after some calculations, finally gives 

Y(x,z)= / ^f^' X ^= T~ L ^ 

tan _ *v^-i + arctan(V2a; - 1) 



Note that the condition Y(l, z) = sec(z) + tan(z) — 1 is respected. 
Indeed 



y(M) 

and 



tan 



l+tan(|) 1 + cos(z) + sin(^ 



tan (— f + f ) 1 - tan (f) 1 + cos(z) - sin(z) 

1 + cos 2 (z) + 2cos(z) — sin 2 (2i) 
(1 + cos(z) — sin(,z)) 2 
cos(z) 1 + sin(z) 



1 — sin(z) cos(z) 
Moreover, using the fact that 

exp (z\/—2x + 1) = cos(z\ / 2x — 1) + i sin(z\/2x — 1), 

we can write eq. ([5]) in terms of the exponential function as 



Y(x z) = 2(xex P (V-2x + lz)-x) 

(V-2x + 1 - 1) exp (V-2x + lz) + V-2x + l + 1 ' 



Performing the substitution x = 1/4 we have that 



1 ' ji 2+1 
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the Taylor expansion of which is 



1 \ 1 1 



Y -, z = -z + -z z H z 6 + — z 4 H z° + .... 



4' / 4 8 96 48 120 

Using the result of Proposition Q] we can now effectively calculate the proba- 
bility p n that two ranked trees having n leaves are identical when generated 
independently by the coalescent process: p2 = j\ X \ = l,Pz = ^ X \ = 

1,P4 = fr x ^ = |,P5 = fr x ii = | and P6 = |t x iio = and so on - 

3.1.2 The probability distribution of the number of cherries 

We are now ready to state the enumeration of ranked trees with respect to 
size and number of nodes of outdegree 0, 1 or 2, when each tree is weighted by 
its probability under the coalescent process. This exact enumerative result 
is novel and achieved with the help of the weighted generating function 

nn(t)-l-o(t) 
tGlZ„,n>l v w ' 

Function F has a more intuitive interpretation if one considers the trans- 
formation Y w = — instead. It can be interpreted as a weighted exponential 
generating function counting 0-1-2-increasing trees with respect to the out- 
degree and the total number of nodes. 

Starting from equation ([U]) , we perform some substitutions on Y to obtain 
Y w . In particular we have Y w = Y {% , 2zj and, multiplying by z, we finally 
obtain the desired function F. 

Proposition 2 The weighted ordinary generating function of ranked trees 
considered with respect to size and number of cherries is 

zx exp f 2z \l — x + l) — zx 
F( x z) = ^=^= — - ' ^=^= (7) 

V,J (y/-X + 1-1) exp (2Z y/-X + l) + 1 + y/ -X + 1 ' 

The probability of having d cherries in a ranked tree of size n corresponds 
to the coefficient of x° z n in the Taylor expansion of F around z = 0, i.e. 

P n ( = o') = [x°'z n ]F(x,z). 

The first terms of the Taylor expansion of ([7]) are described below; 
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F(x,z) = xz 2 
+xz 3 

+\{x 2 + 2x)z 4 

+- (2x 2 + x)z 5 
3 

+— (2x 3 + Ux 2 + 2x)z 6 
15 

+— (17 x 3 + 26x 2 +2x)z 7 
45 v ' 

+— (I7x 4 + 180x 3 + 114x 2 + 4x)z 8 

315 v ; 
+ ... . 

Looking at Fig. Q]one can check that, for example, there are exactly 11 
trees represented by the monomial x 2 z 6 . Each one of them has probability 
jg. This is in agreement with the term j^x 2 z® in the expansion. Indeed, 
is the probability to obtain a ranked tree of size 6 with two cherries. 

Using the result of Proposition [2] we compute the discrete probability 
distribution of the random variable o(t) for trees of fixed size n. In this case 
o is a random variable which takes values between 1 and L n /2j- In Fig. [3] 
we have depicted the distribution of o for a ranked tree of size n = 54. 

By Proposition [2] one can also determine the expected value E (n) and 
the variance Var (n) of the random variable o in dependence of tree size 
n. Using other methods these have been determined before, for example by 
McKenzie 0]. 

Using our approach the expectation is 

If n > 2, this simplifies to 

Eo(n) = - . 
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The second moment is 

dfx—) rl 2 F rlF 

EAn) = [^]-^(l,z) = [*»] — + n — (l,z) 

r nl 2(z 7 -6z 6 + 15z 5 - 15z 4 ) 

= [*"] ^ 

= [z] ((j3 T )3(45-Ty + y-yJJ + ^ ) - 

If n > 6, and using Uar D (n) = E 2(n) — E£(n), we obtain the variance of o 

(n-5)(n-6) 2(n-4)(n-5) 
Var o( n) = + 

(n-3)(n-4) (ra-2)(n-3) 



n re 

+ 3 ~ "9 
2n 

45 " 



Note that this is the variance of cherries of independently generated 
trees. Considering 'linked' trees, i.e. along a recombining chromosome, the 
variance is smaller. 

3.2 The number of pitchforks 

The recursive construction presented in Section 13.1.11 can be extended in 
order to consider also pitchforks. 

Using different methods, they have been studied before for example by 
Rosenberg [10]. A pitchfork in a ranked (resp. 0-1-2-increasing) tree is 
simply a subtree with 3 leaves (resp. 2 nodes). If r(t) denotes the number 
of pitchforks in t £ low the construction of Section [3. 1.1 1 is extended to the 
new random variable r. We find the following succession rule: 

(o, r, m) — > (o, r, m + l) r (o, r + 1, m + l)°~ r 

(o,r,m) -)• (o + l,r-l,rei + l) r (o + l,r,rei + l) m ~ 2o+1_r . 

Considering now 

x o(t) y r(t) z m(t) 



Y(x,v,z) = 



m(t) 
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we obtain the following differential equation: 



^dY . dY . ^dY 

(v + x)(v - 1)— = x + xY + x(v - 2x)— + (xz - 1) — . (8) 
dv dx dz 

For v = 1 it reduces to eq. (j^J) but there is non easy analytic solution. 

However, we can still obtain the expected value E r (m) for the number 
of pitchforks in 0-1-2 increasing trees with m nodes. Starting from (|8|) and 
performing the substitutions x = 1/2 and z = 2z we obtain 



dY/l v2z \ _ l + Y(l,v,2z)+2(z-l)^(±,v,2z) 



dv V2' ' J 2{v + \){v-l) 

2{v + \) 



from which we have 



dY ( l Vj2z \ - u ml ^(i^2z)+2(z-l)^(I,,,2z) 



\2' ' / 2(u + |)(u-l) 

+ [z m ] 



,mi da; l2' U ' 22; J 



2{v + \) 



When v — > 1 we find that 



E r (m) = [z ] lim 



+ 



1 2 (« + i) (t, - 1) 

2E (m) 



The considered limit can be determined according to V Hospital's rule 
taking the derivative of the numerator and the denominator with respect to v 
and performing then the substitution v = 1. Furthermore, from Section[37L2] 
E {m) = (m+ l)/3, and thus 
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E r {m) = [z m ] - ]T r(t) 



2»n(t)— o(t) 

m(i) ! 



,m(t) 



+ [z m ]i-(z-l) £ r (t)m(t) 



om(t)-l-o(t) 

m(t) ! 



,m(f)-l 



2(m + 1) 



ift(m) + [«»] ^ 



fc>0 / 



1„, , mEJm) - (m + l)E r (m + 1) 2(m+l) 

—E r (m) H 1 

3 v ; 3 9 



Reordering terms we obtain the recursion 



£ r (2) = 1; 
(ra + l).B r (m + 1) = (m-2)E r (m) + 



2(m + 1) 



This gives for an increasing tree with m > 2 nodes 

m + 1 



E r (m) 



i. 



Prom eq. ([5]) one can also compute the full probability distribution of 
the random variable r when an increasing tree of fixed size is generated by 
the coalescent process. Indeed, if we consider 



Y m (x,v,z)= 



x o(i) v r(t) z rn 
m ! 



the following result provides a recursion which can be used to compute 
the functions Y m for any m > 1. 

Proposition 3 The following recursion holds: 
Y\ = xz 

Ym+l = 



dY dY dY 

(v + a: HI - uj— h xr m + onv — 1- xz— — 

dv dx dz 



dz 
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Proof. Consider eq. ([8]) without the monomial x which appears there. If we 
then isolate the term and integrate both sides of the resulting equation 
with respect to the variable z we obtain the polynomial Y m +\ starting from 
Y = Y m . □ 

The results for m = 1,2, 3, 4, 5 are as follows 



Yt = 


xz 


Y 2 = 


1 2 

-vxz 
2 


Y 3 = 


1 o x 2 z 3 

—vxz H 

6 6 


Y A = 


1 a X 2 Z^ 1 2 4 

2i VXZ + 24 + 8 VX Z 


Y 5 = 


1 r zV 7 

vxz H 1 vx 

120 120 120 



9 R 1 9 9 R X Z 

V + — v z x z z b H 

40 30 

The above results concerning cherries and pitchforks can be extended to 
the joint and conditional probability distributions (see Fig.0|). Summarizing, 
we state 

Proposition 4 i) The probability of having r' pitchforks in an increasing 
tree of size m (see Fig. is 



P m (r = r') = [v r ']Y m (±,v,2y, 



ii) The probability of having o' cherries and r' pitchforks in an increasing 
tree of size m is 

P m (o = o',r = r') = [x°'v r '}Y m ^,v,2); 

Hi) The probability of having r' pitchforks in an increasing tree of size m 
given it has d cherries (see Fig. ^ is 

p f _ ,, „_ P m (o = o',r = r>) _ [x°'v r ']Y m {%,v,2) 
P m[ r-r\o-o)- Pm{o = o>) ~ M y m(i)1)2) • 
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Figure 3: Distributions of cherries and pitchforks for 7^.54 (i.e. Xoi2 53 )- 



























































10 














































































































•ks 
























































xhfoi 
























































■v) pi' 
















































































































mean {-\ 


































































































































































\ 


























































: 


n 

























































5 10 15 20 

number cherries 



Figure 4: Mean of the conditional probability distribution of pitchforks given the 
number of cherries for 7^.54. 
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